Maximal Repetitions in Written Texts: Finite Energy Hypothesis vs. Strong Hilberg Conjecture
نویسنده
چکیده
The article discusses two mutually-incompatible hypotheses about the stochastic mechanism of the generation of texts in natural language, which could be related to entropy. The first hypothesis, the finite energy hypothesis, assumes that texts are generated by a process with exponentially-decaying probabilities. This hypothesis implies a logarithmic upper bound for maximal repetition, as a function of the text length. The second hypothesis, the strong Hilberg conjecture, assumes that the topological entropy grows as a power law. This hypothesis leads to a hyperlogarithmic lower bound for maximal repetition. By a study of 35 written texts in German, English and French, it is found that the hyperlogarithmic growth of maximal repetition holds for natural language. In this way, the finite energy hypothesis is rejected, and the strong Hilberg conjecture is partly corroborated.
منابع مشابه
The Relaxed Hilberg Conjecture: A Review and New Experimental Support
The relaxed Hilberg conjecture states that the mutual information between two adjacent blocks of text in natural language grows as a power of the block length. The present paper reviews recent results concerning this conjecture. First, the relaxed Hilberg conjecture occurs when the texts repeatedly describe a random reality and Herdan’s law for facts repeatedly described in the texts is obeyed....
متن کاملHilberg’s Conjecture: an Updated FAQ
This note is a brief introduction to theoretical and experimental results concerning Hilberg’s conjecture, a hypothesis about natural language. The aim of the text is to provide a short guide to the literature. 1 What is Hilberg’s conjecture? In the early days of information theory, Shannon (1951) published estimates of conditional entropy for printed English. A few decades later, Hilberg (1990...
متن کاملHilberg’s Conjecture — a Challenge for Machine Learning
We review three mathematical developments linked with Hilberg’s conjecture—a hypothesis about the power-law growth of entropy of texts in natural language, which sets up a challenge for machine learning. First, considerations concerning maximal repetition indicate that universal codes such as the Lempel-Ziv code may fail to efficiently compress sources that satisfy Hilberg’s conjecture. Second,...
متن کاملOn Hilberg's Law and Its Links with Guiraud's Law
Hilberg (1990) supposed that finite-order excess entropy of a random human text is proportional to the square root of the text length. Assuming that Hilberg’s hypothesis is true, we derive Guiraud’s law, which states that the number of word types in a text is greater than proportional to the square root of the text length. Our derivation is based on some mathematical conjecture in coding theory...
متن کاملOn normalizers of maximal subfields of division algebras
Here, we investigate a conjecture posed by Amiri and Ariannejad claiming that if every maximal subfield of a division ring $D$ has trivial normalizer, then $D$ is commutative. Using Amitsur classification of finite subgroups of division rings, it is essentially shown that if $D$ is finite dimensional over its center then it contains a maximal subfield with non-trivial normalize...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- Entropy
دوره 17 شماره
صفحات -
تاریخ انتشار 2015